## here() starts at /home/alex/3ieo/ieoproject-stubibi
In autoimmune type 1 diabetes (T1D), the gradual decline of functional β-cell mass initiates the onset of hyperglycemia and necessitates lifelong insulin therapy. However, early-stage dysregulation of glucagon secretion poses a significant challenge to effective disease management. This dysregulation is characterized by inadequate suppression during hyperglycemia and insufficient secretion during hypoglycemia. The specific mechanisms underlying this dysregulated glucagon secretion in T1D remain unclear. However, it is needed to highlight that both the parasympathetic and sympathetic nervous systems play crucial roles in regulating islet hormone secretion to maintain glycemic balance.
Laser microdissection of intact human islets followed by islet RNA-seq offers an initial step towards understanding sympathetic nervous system-related gene pathways without the inherent stress associated with islet and single-cell isolation procedures. All islets utilized in the study of Campbell-Thompson et al. (2021) are from donors without diabetes (ND) and donors with auto-antibodies (AAb) contained β-cells, whereas only a few residual INS-positive β-cells were found in rare islets from T1D donors. Hence, this model was also employed to investigate how the loss of paracrine insulin secretion could influence α-cell gene expression.
The resulting raw RNA-seq data have been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession [GSE181674] (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE181674).
Here, we show a first exploratory analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (https://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a [SummarizedExperiment] (https://bioconductor.org/packages/SummarizedExperiment) object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.
The question of interest to address in this work is whether gene expression differs in the islets from donors without diabetes (ND), donors with auto-antibodies (AAb), and donors with type 1 diabetes (T1D).
We start importing the raw table of counts from the GSE181674.rds file.
library(SummarizedExperiment)
se <- readRDS(file.path(system.file("extdata", package="IEOproject"), "GSE181674.rds"))
se
class: RangedSummarizedExperiment
dim: 25118 18
metadata(4): experimentData annotation ensemblVersion urlProcessedData
assays(1): counts
rownames(25118): 1 10 ... 9994 9997
rowData names(5): gene_id gene_biotype description gene_id_version
symbol
colnames(18): SRR15372438 SRR15372439 ... SRR15372454 SRR15372455
colData names(39): title geo_accession ... disease state:ch1 tissue:ch1
We have 25118 genes by 18 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run (SRR) identifiers.
First, we explore our SummarizedExperiment object.
The rowData has information about the features profiled in the assay.
head(rowData(se))
DataFrame with 6 rows and 5 columns
gene_id gene_biotype description
<character> <character> <character>
1 ENSG00000121410 protein_coding alpha-1-B glycoprote..
10 ENSG00000156006 protein_coding N-acetyltransferase ..
100 ENSG00000196839 protein_coding adenosine deaminase ..
1000 ENSG00000170558 protein_coding cadherin 2 [Source:H..
10000 ENSG00000117020 protein_coding AKT serine/threonine..
100008586 ENSG00000236362 protein_coding G antigen 12F [Sourc..
gene_id_version symbol
<character> <character>
1 ENSG00000121410.11 A1BG
10 ENSG00000156006.4 NAT2
100 ENSG00000196839.12 ADA
1000 ENSG00000170558.8 CDH2
10000 ENSG00000117020.16 AKT3
100008586 ENSG00000236362.8 GAGE12F
colnames(rowData(se))
[1] "gene_id" "gene_biotype" "description" "gene_id_version"
[5] "symbol"
dim(rowData(se))
[1] 25118 5
We can see information about the genes: gene_id (ensembl id), gene_biotype, description, gene_id_version and symbol. Among these, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis.
The colData contains phenotypic information about the samples.
head(colData(se), n=3)
DataFrame with 3 rows and 39 columns
title geo_accession status submission_date
<character> <character> <character> <character>
SRR15372438 AAb_6267 GSM5509128 Public on Aug 09 2021 Aug 09 2021
SRR15372439 AAb_6424 GSM5509129 Public on Aug 09 2021 Aug 09 2021
SRR15372440 AAb_6429 GSM5509130 Public on Aug 09 2021 Aug 09 2021
last_update_date type channel_count source_name_ch1
<character> <character> <character> <character>
SRR15372438 Aug 10 2021 SRA 1 laser microdissected..
SRR15372439 Aug 10 2021 SRA 1 laser microdissected..
SRR15372440 Aug 10 2021 SRA 1 laser microdissected..
organism_ch1 characteristics_ch1 characteristics_ch1.1
<character> <character> <character>
SRR15372438 Homo sapiens tissue: laser microd.. disease state: non-d..
SRR15372439 Homo sapiens tissue: laser microd.. disease state: non-d..
SRR15372440 Homo sapiens tissue: laser microd.. disease state: non-d..
characteristics_ch1.2 molecule_ch1 extract_protocol_ch1
<character> <character> <character>
SRR15372438 biorep: 1 polyA RNA PicoPure RNA extract..
SRR15372439 biorep: 2 polyA RNA PicoPure RNA extract..
SRR15372440 biorep: 3 polyA RNA PicoPure RNA extract..
extract_protocol_ch1.1 extract_protocol_ch1.2 taxid_ch1
<character> <character> <character>
SRR15372438 Agilent 2100 Bioanal.. A sample of 6 ng RNA.. 9606
SRR15372439 Agilent 2100 Bioanal.. A sample of 6 ng RNA.. 9606
SRR15372440 Agilent 2100 Bioanal.. A sample of 6 ng RNA.. 9606
data_processing data_processing.1
<character> <character>
SRR15372438 Quality control perf.. Reads mapped to huma..
SRR15372439 Quality control perf.. Reads mapped to huma..
SRR15372440 Quality control perf.. Reads mapped to huma..
data_processing.2 data_processing.3
<character> <character>
SRR15372438 Gene expression quan.. A multi-dimensional ..
SRR15372439 Gene expression quan.. A multi-dimensional ..
SRR15372440 Gene expression quan.. A multi-dimensional ..
data_processing.4 data_processing.5 data_processing.6
<character> <character> <character>
SRR15372438 Empirical Bayes tech.. Network analysis and.. Genome_build: hg38
SRR15372439 Empirical Bayes tech.. Network analysis and.. Genome_build: hg38
SRR15372440 Empirical Bayes tech.. Network analysis and.. Genome_build: hg38
data_processing.7 data_processing.8
<character> <character>
SRR15372438 Supplementary_files_.. Supplementary_files_..
SRR15372439 Supplementary_files_.. Supplementary_files_..
SRR15372440 Supplementary_files_.. Supplementary_files_..
data_processing.9 platform_id data_row_count
<character> <character> <character>
SRR15372438 Supplementary_files_.. GPL21290 0
SRR15372439 Supplementary_files_.. GPL21290 0
SRR15372440 Supplementary_files_.. GPL21290 0
instrument_model library_selection library_source
<character> <character> <character>
SRR15372438 Illumina HiSeq 3000 cDNA transcriptomic
SRR15372439 Illumina HiSeq 3000 cDNA transcriptomic
SRR15372440 Illumina HiSeq 3000 cDNA transcriptomic
library_strategy relation relation.1
<character> <character> <character>
SRR15372438 RNA-Seq BioSample: https://w.. SRA: https://www.ncb..
SRR15372439 RNA-Seq BioSample: https://w.. SRA: https://www.ncb..
SRR15372440 RNA-Seq BioSample: https://w.. SRA: https://www.ncb..
supplementary_file_1 biorep:ch1 disease state:ch1
<character> <character> <character>
SRR15372438 NONE 1 non-diabetic islet a..
SRR15372439 NONE 2 non-diabetic islet a..
SRR15372440 NONE 3 non-diabetic islet a..
tissue:ch1
<character>
SRR15372438 laser microdissected..
SRR15372439 laser microdissected..
SRR15372440 laser microdissected..
colnames(colData(se))
[1] "title" "geo_accession" "status"
[4] "submission_date" "last_update_date" "type"
[7] "channel_count" "source_name_ch1" "organism_ch1"
[10] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[13] "molecule_ch1" "extract_protocol_ch1" "extract_protocol_ch1.1"
[16] "extract_protocol_ch1.2" "taxid_ch1" "data_processing"
[19] "data_processing.1" "data_processing.2" "data_processing.3"
[22] "data_processing.4" "data_processing.5" "data_processing.6"
[25] "data_processing.7" "data_processing.8" "data_processing.9"
[28] "platform_id" "data_row_count" "instrument_model"
[31] "library_selection" "library_source" "library_strategy"
[34] "relation" "relation.1" "supplementary_file_1"
[37] "biorep:ch1" "disease state:ch1" "tissue:ch1"
dim(colData(se))
[1] 18 39
We have a total of 39 phenotypic variables. The second column geo_accession contains GEO Sample Accession Number (GSM) identifiers. GSM identifiers define individual samples, understood in our context as individual sources of RNA. Our samples are not repeated, indicating that among the 18 samples we have not technical replicates:
ncol(se)
[1] 18
length(unique(se$geo_accession))
[1] 18
table(lengths(split(colnames(se), se$geo_accession)))
1
18
So, we have 18 different individual samples out of 18 total samples so we have not technical replicates.
To proceed further exploring this dataset, we are going to use the edgeR package and build a DGEList object, incorporating the gene metadata, which includes the gene symbol. We have to be sure that the dimensions of the DGEList object and the SummarizedExperiment object are the same.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=rowData(se), samples=colData(se))
dim(dge)
[1] 25118 18
dim(dge)==dim(se)
[1] TRUE TRUE
Now, we calculate \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.25)
assays(se)$logCPM[1:3, 1:6] #Visualization
SRR15372438 SRR15372439 SRR15372440 SRR15372441 SRR15372442 SRR15372443
1 -1.175019 -1.482126 -1.735242 -7.356596 -7.3565962 -2.893128
10 -7.356596 -7.356596 -7.356596 -7.356596 -7.3565962 -7.356596
100 2.050547 2.088666 1.659402 1.630366 0.3887662 1.389090
Let’s explore now some of the phenotypic variables.
Some of them provide information about the samples: title (ID for each sample), geo_accesion, type (SRA), relation, characteristics.
We have also information about the study such as status, submission_data, last_update_date.
Then we can see features related to the organism (Homo sapiens) in organism_ch1 and taxid_ch1 and to the tissue (laser microdissected islets) in tissue in which the analysis has been conducted. These features remain the same across all samples studied.
table(se$organism_ch1)
Homo sapiens
18
table(se$`tissue:ch1`)
laser microdissected islets
18
We can also identify some variables associated with technical factors, such as the extraction protocol (extract_protocol_ch1.X, molecule_ch1), RNA-seq experiment (platform_id, instrument_model, library_selection, library_source, library_strategy etc.) and data processing (data_processing.X) that remain the same across samples.
table(se$extract_protocol_ch1)
PicoPure RNA extraction kit (Thermofisher Scientific, cat# 12204-01) with genomic DNA removal using recombinant DNAse1 (Worthington Biochemical Corporation, cat# LS006361)
18
table(se$data_processing.1)
Reads mapped to human genome reference using Bowtie2.
18
Finally one feature we are particularly interested in is the disease state.
We can observe that for each category of the disease state (disease state:ch1),
there are several biological replicates (biorep:ch1):
table(se$`disease state:ch1`)
control non-diabetic
8
diabetic
7
non-diabetic islet autoantibody-positive individuals
3
To facilitate handling the variables disease state:ch1 and biorep:ch1 we are going to recode them
as follows.
se$disease_state <- as.factor(se$`disease state:ch1`)
levels(se$disease_state) <- c("ND", "T1D", "AAb")
dge$samples$disease_state <- as.factor(dge$samples$disease.state.ch1)
levels(dge$samples$disease_state) <- c("ND", "T1D", "AAb")
se$bioreplicate <- se$`biorep:ch1`
#Visualization
se$disease_state
[1] AAb AAb AAb ND ND ND ND ND ND ND ND T1D T1D T1D T1D T1D T1D T1D
Levels: ND T1D AAb
dge$samples$disease_state
[1] AAb AAb AAb ND ND ND ND ND ND ND ND T1D T1D T1D T1D T1D T1D T1D
Levels: ND T1D AAb
In Table 1 below, we show this variable jointly with the sample ID and the sample title (identifier used in the paper) and bioreplicate to try to gather as much understanding as possible on the underlying experimental design.
| Identifer | Sample Title | Disease State | Bioreplicate |
|---|---|---|---|
| SRR15372438 | AAb_6267 | non-diabetic islet autoantibody-positive individuals | 1 |
| SRR15372439 | AAb_6424 | non-diabetic islet autoantibody-positive individuals | 2 |
| SRR15372440 | AAb_6429 | non-diabetic islet autoantibody-positive individuals | 3 |
| SRR15372441 | ND_6131 | control non-diabetic | 1 |
| SRR15372442 | ND_6232 | control non-diabetic | 2 |
| SRR15372443 | ND_6234 | control non-diabetic | 3 |
| SRR15372444 | ND_6279 | control non-diabetic | 4 |
| SRR15372445 | ND_6293 | control non-diabetic | 5 |
| SRR15372446 | ND_6336 | control non-diabetic | 6 |
| SRR15372447 | ND_6339 | control non-diabetic | 7 |
| SRR15372448 | ND_6401 | control non-diabetic | 8 |
| SRR15372449 | T1D_6209 | diabetic | 1 |
| SRR15372450 | T1D_6228 | diabetic | 2 |
| SRR15372451 | T1D_6306 | diabetic | 3 |
| SRR15372452 | T1D_6342 | diabetic | 4 |
| SRR15372453 | T1D_6362 | diabetic | 5 |
| SRR15372454 | T1D_6371 | diabetic | 6 |
| SRR15372455 | T1D_6414 | diabetic | 7 |
When comparing three main groups for differential gene expression, our objective is to assess the differences in gene expression patterns between each pair of groups. This involves pairwise comparisons between the three groups. In Table 2 we can see how many bioreplicates we have for each condition.
| Disease state | Frequency |
|---|---|
| ND | 8 |
| T1D | 7 |
| AAb | 3 |
Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.
Figure 1: Library sizes in increasing order
We don’t see substantial differences in sequencing depth. The sequencing depth ranges from 35 to 48 million reads. Lower sequencing depth samples seem to be enriched in diabetic disease state whereas the other sample groups are not confounded with sequencing depth. Despite this slight difference, there are diabetic samples with higher sequencing depths so the distribution of the different disease states seem to be more or less random indicating no confusion with sequencing depth.
Figures 2 and 3 below show the distribution of expression values per sample in logarithmic CPM units of expression.
Figure 2: Non-parametric density distribution of expression profiles per sample
Figure 3: Boxplot of expression profiles per sample
We can see in Figure 2 the characteristic two peaks, one around negative values of \(\log_2\) CPM representing those genes that are not expressed, and the other around positive values representing those genes that are expressed. There are no substantial differences between the samples in the distribution of expression values. The samples show big overlap.
On the other hand, when depicting the boxplot (Figure 3) to assess differential distribution or variability, only one of the samples (T1D_6209) appears to exhibit a slightly different mean value of \(\log_2\) CPM, as depicted in Figure 3. However, this observation is not particularly noteworthy.
Let’s calculate now the average expression per gene through all the samples. Figure 4 shows the distribution of those values across genes.
Figure 4: Distribution of average expression level per gene
As expected, we have two modes, one for genes that are lowly expressed and another for genes which are more expressed in the samples. We will consider that the lowly expressed genes are the ones below 1 \(\log_2\) CPM levels (red line in Figure 4).
We filter lowly-expressed genes using the function filterByExpr(), grouping by sample-group to define the minimum number of samples in which a gene should be expressed.
mask <- filterByExpr(dge, group=se$disease_state)
se.filt <- se[mask, ]
dim(se.filt)
[1] 17415 18
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 17415 18
Figure 5: Distribution of average expression level per gene with filtered genes highlighted in red 1
We are left with 17415 genes in comparison to the 25118 genes we had at the beginning. We can visualize this in Figure 5, since the genes that are left after filtering lowly-expressed genes are highlighted in red.
With this initial filtering method, we fail to eliminate all genes with low expression levels (Figure 5). Many genes exhibit negative \(\log_2\) CPM levels. However, ultimately, we opted for a filtering cutoff requiring a minimum average of 1 \(\log_2\) CPM unit.
mask <- rowMeans(assays(se)$logCPM) > 1
se.filt <- se[mask, ]
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 12185 18
par(mar=c(4, 5, 1, 1))
Figure 6: Distribution of average expression level per gene with filtered genes highlighted in red 2
We are left with 12185 genes in comparison to the 25118 genes we had at the beginning (Figure 6). Note that after filtering with a minimum average of 1, we end up with less genes (12185) in comparison to the first method in which we define a minimum number of samples in which a gene should be expressed (17415). Both dgeList object and se object have been filtered.
We calculate now the normalization factors on the filtered expression data set. This is an application of the Trimmed Mean of M-values (TMM) method, that takes into consideration different RNA composition of the samples by estimating a scaling factor for each library.
dge.filt <- calcNormFactors(dge.filt)
Now we replace the raw \(\log_2\) CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE)
We will examine now the MA-plots of the expression profiles before and after normalization in Figure 7.
Figure 7: MA-plots of expression values before and after filtering and normalisation
After filtering out genes with low expression levels and performing between-sample normalization, we eliminate sample-specific effects and biases in differential gene expression caused by low expression. As a result, the proportion of reads assigned to a particular gene in a sample is no longer influenced by the overall expression properties of the entire sample. Instead, it depends solely on the expression level of that gene. We discard counts at low values because ratios between small numbers can lead to inflated fold-changes, giving rise to the particular vuvuzela shape that we can see more clearly in the plots obtained before normalization. In general, fold-changes derived from high expression values are more trustworthy than those derived from low-expression values. As shown in the MA-plots of Figure 7 after normalization, the vuvuzela shape is reduced.
We will examine now the MA-plots of each sample compared to the average of the rest of the samples in order to detect any possible problematic samples.
Figure 8: MA-plots of filtered and normalized expression values for each sample AAb vs T1D
We can see in Figure 8 that a number of samples display some expression-level dependent bias. By visualizing the deviation of individual samples from the average expression across all samples, we can detect specific expression patterns in each individual and make conclusions.
Most of the samples exhibit a slight expression-level dependent bias at the low-end and/or high-end of the expression spectrum. However, this bias appears to be more pronounced in samples T1D_6362 and T1D_6371, prompting us to monitor other samples with similar biases for any additional unexpected features. Should such features arise, we may consider their removal from the analysis. Initially, we refrain from eliminating these samples due to the experiment’s limited sample size and the satisfactory quality of other quality control measures.
In instances where this bias manifests at the low end of the expression spectrum, one potential solution could involve implementing a more stringent filter on minimum expression levels. However, upon testing with higher thresholds, we observe a persistent effect and risk discarding numerous genes that could prove valuable for subsequent comparisons.
Furthermore, we note the absence of discernible patterns between samples of the same category.
Here we try to understand the underlying experimental design.
First of all, we can do an a multi-dimensional scaling (MDS) plot between the disease state to see if there is a grouping between samples and to detect possible outliers. As we can see in Figure 9, samples do not show a very good grouping by disease state. The MDS plot suggests that AAb gene expression values appear to lie largely intermediate to those of both ND and T1D that are more separated between them. We can detect for one outlier of a T1D donor. We elected not to exclude this donor based on the additional QC parameters which demonstrated that expression values were highly consistent across the range of expression for all samples as mentioned by the authors of the paper.
Figure 9: Multidimensional scaling plot (MDS) and hierarchical clustering by gene expression of the samples
Labels correspond to DiseaseState_sampleID and colors indicate disease state.
In the supplementary data, additional phenotypic information about the samples is available, which is not included in the SummarizedExperiment object. To explore the possibility of other sample groupings or potential confounding variables, we intend to augment the colData dataframe of the object with this phenotypic data. Among these variables, we have identified potential confounding factors such as sex, cause of death, BMI, and age. While race could also be considered a potential confounder, the majority of donors in the study were Caucasian, limiting its utility in this context. Since age and BMI are continuous variables, we have decided to define ranges of values for a clearer analysis.
sex <- c("female","male","male","male","female","female","male","female","female","male","female","female","male","male","female","male","female","male")
cause_death <- c("Anoxia", "Head Trauma", "Head Trauma", "Anoxia", "Head Trauma", "Head Trauma", "Head Trauma", "Anoxia", "Head Trauma", "Head Trauma", "Head Trauma", "Cerebral Edema", "Anoxia", "Head Trauma", "Anoxia", "Head Trauma", "Cerebral Edema", "Anoxia")
age <- c(23.0, 17.65, 22.10, 24.2, 14.0, 20.0, 19.0, 9, 14.3, 23.3, 25.07, 5.0, 13.0, 19.0, 14.0, 24.9, 12.5, 23.1)
bmi <- c(18.60, 20.80, 28.90, 34.00, 25.60, 25.00, 24.80, 31.30, 51.40, 19.60,
23.50, 15.90, 16.60, 17.40, 24.30, 24.50, 28.40, 28.50)
colnames(dge.filt)<-dge.filt$samples$title
colData(se.filt)$sex <- as.factor(sex)
colData(se.filt)$cause_death <- as.factor(cause_death)
colData(se.filt)$age <- as.numeric(age)
colData(se.filt)$bmi <- as.numeric(bmi)
dge.filt$samples$sex <- as.factor(sex)
dge.filt$samples$cause_death <- as.factor(cause_death)
dge.filt$samples$age <- as.numeric(age)
dge.filt$samples$bmi <- as.numeric(bmi)
Figure 10: Multidimensional scaling plot (MDS) of the samples
Labels correspond to DiseaseState_sampleID and colors indicate the different potential confounding variables
From Multidimensional Scaling Plots (MDS) of figure 9 we can conclude that with respect to sex, death cause, BMI and age we cannot find any substantial differences or grouping behavior between samples. This leads us to the conclusion that there is no confounding due to these features.
Figure 11: Hierarchical clustering of the samples
Labels correspond to DiseaseState_sampleID and colors indicate the different potential confounding variables.
According to the previous results, in Figure 11 depicting the hierarchical clusters of the samples colored by the different variables studied, we observe that there is no clustering based on sex, death cause or BMI. However, we have noticed some kind of grouping based on age as the younger samples are on the left and the older on the right.
We examined how samples group together by multidimensional scaling and hierarchical clustering, annotating disease state and sex, BMI, age and cause of death. We calculate again log CPM values with a high prior count(3) to moderate extreme fold-changes produced by low counts.
In the MDS plots shown in Figure 10, a conspicuous clustering pattern is observed, characterized by a predominant cluster encompassing the majority of samples, followed by 3 outliers positioned distinctly apart from the main cluster. Particularly noteworthy is the sample labeled T1D_6109, corresponding to the youngest individual in the dataset, aged 5 years, as indicated in the MDS plot by age. Notably, there is no apparent clustering pattern based on any examined variable. The presence of this distinct outlier, specifically the sample from the 5-year-old child, suggests a significant deviation from the overall expression profiles observed in the dataset, such deviation warrant further investigation to elucidate underlying factors contributing to the observed separation, potentially shedding light on age-related expression patterns or technical artifacts influencing the analysis.
In the hierarchical clustering analysis annotated by sex (see Figure 11), no distinct clustering pattern emerges based on this variable. This observation suggests that the samples do not segregate into discernible groups based on gender. On the other hand, in the hierarchical clustering analysis annotated by BMI (see Figure 11), while there is clustering observed, it is not prominently delineated by this variable alone. Notably, the samples primarily cluster based on BMI, with a notable concentration of underweight samples within a distinct group. However, other BMI categories do not exhibit clear, separate clusters, suggesting that additional factors may contribute to the overall clustering pattern observed.
In the hierarchical clustering analysis annotated by age (see Figure 11), a discernible grouping of samples is evident. This pronounced clustering pattern suggests that age could indeed be a confounding variable in the dataset analysis. The presence of clearly delineated age-based clusters implies that age-related biological or developmental differences may significantly influence the expression profiles observed in the samples. Consequently, failing to account for age as a confounding variables in the analysis could potentially introduce bias or confound the interpretation of results. Therefore, while age should be carefully considered and appropriately addressed as a potential confounding variables, additional validation and exploration are warranted to confirm its true impact on the clustering patterns observed.
We perform a simple assessment of the extent of expression changes and their associated p-values using the F-test implemented in the R/Bioconductor package sva. We compare T1D with AAb, T1D with ND and ND with AAb samples. We first subset the data as follows:
# Subset for T1D vs AAb
#T1D vs AAb
se.filt_T1D_AAb <- se.filt[, se.filt$disease_state %in% c("T1D", "AAb")]
se.filt_T1D_AAb$disease_state <- droplevels(se.filt_T1D_AAb$disease_state)
#AAb vs ND
se.filt_AAb_ND <- se.filt[, se.filt$disease_state %in% c("ND", "AAb")]
se.filt_AAb_ND$disease_state <- droplevels(se.filt_AAb_ND$disease_state)
#ND vs T1D
se.filt_ND_T1D <- se.filt[, se.filt$disease_state %in% c("T1D", "ND")]
se.filt_ND_T1D$disease_state <- droplevels(se.filt_ND_T1D$disease_state)
In the second step above, we dropped the unused levels from the disease_state factor variable. This is important to avoid using factor levels that do not exist in this subset of the data. We build now the corresponding full and null model matrices.
# Subset for T1D vs AAb
mod_T1D_AAb <- model.matrix(~ disease_state, colData(se.filt_T1D_AAb))
mod0_T1D_AAb <- model.matrix(~ 1, colData(se.filt_T1D_AAb))
# Subset for AAb vs ND
mod_AAb_ND <- model.matrix(~ disease_state, colData(se.filt_AAb_ND))
mod0_AAb_ND <- model.matrix(~ 1, colData(se.filt_AAb_ND))
# Subset for ND vs T1D
mod_ND_T1D <- model.matrix(~ disease_state, colData(se.filt_ND_T1D))
mod0_ND_T1D <- model.matrix(~ 1, colData(se.filt_ND_T1D))
Finally, we conduct the F-test implemented in the package sva and examine the amount of differential expression between pairwise disease_states.
library(sva)
# T1D vs AAb
pv_T1D_AAb <- f.pvalue(assays(se.filt_T1D_AAb)$logCPM, mod_T1D_AAb, mod0_T1D_AAb)
sum(p.adjust(pv_T1D_AAb, method="fdr") < 0.05)
[1] 0
sum(p.adjust(pv_T1D_AAb, method="fdr") < 0.1)
[1] 0
# AAb vs ND
pv_AAb_ND <- f.pvalue(assays(se.filt_AAb_ND)$logCPM, mod_AAb_ND, mod0_AAb_ND)
sum(p.adjust(pv_AAb_ND, method="fdr") < 0.05)
[1] 0
sum(p.adjust(pv_AAb_ND, method="fdr") < 0.1)
[1] 0
# ND vs T1D
pv_ND_T1D <- f.pvalue(assays(se.filt_ND_T1D)$logCPM, mod_ND_T1D, mod0_ND_T1D)
sum(p.adjust(pv_ND_T1D, method="fdr") < 0.05)
[1] 230
sum(p.adjust(pv_ND_T1D, method="fdr") < 0.1)
[1] 841
No DE genes were found between T1D and AAb or ND and AAB. For the comparison between ND and T1D, we identified 230 DE genes at FDR < 5%, but 841 DE genes at FDR < 10%. The distribution of the resulting p-values is shown in Figure 12.
Figure 12: Distribution of raw p-values for an F-test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples
We can see in Figure 12 the distribution of p-values of the comparisons. Under the null hypothesis should be uniform. Assuming that most genes are not differentially expressed (DE), the histogram should have looked uniform with a peak at low p-values for the truly DE genes. Departures from this shape indicate within sample group heterogeneity that may need to be adjusted.
Now we adjust for age as known confounding effect and we adjust for unknown sources of variation with SVA. Then we conduct again the F-test implemented in the package sva and examine the amount of differential expression between pairwise disease_states.
# Subset for T1D vs AAb
mod_T1D_AAb <- model.matrix(~ disease_state + age, colData(se.filt_T1D_AAb))
mod0_T1D_AAb <- model.matrix(~ age, colData(se.filt_T1D_AAb))
# Subset for AAb vs ND
mod_AAb_ND <- model.matrix(~ disease_state + age, colData(se.filt_AAb_ND))
mod0_AAb_ND <- model.matrix(~ age, colData(se.filt_AAb_ND))
# Subset for ND vs T1D
mod_ND_T1D <- model.matrix(~ disease_state + age, colData(se.filt_ND_T1D))
mod0_ND_T1D <- model.matrix(~ age, colData(se.filt_ND_T1D))
library(sva)
sv1 <- sva(assays(se.filt_T1D_AAb)$logCPM, mod_T1D_AAb, mod0_T1D_AAb)
Number of significant surrogate variables is: 2
Iteration (out of 5 ):1 2 3 4 5
names(sv1)
[1] "sv" "pprob.gam" "pprob.b" "n.sv"
sv2 <- sva(assays(se.filt_AAb_ND)$logCPM, mod_AAb_ND, mod0_AAb_ND)
Number of significant surrogate variables is: 2
Iteration (out of 5 ):1 2 3 4 5
names(sv2)
[1] "sv" "pprob.gam" "pprob.b" "n.sv"
sv3 <- sva(assays(se.filt_ND_T1D)$logCPM, mod_ND_T1D, mod0_ND_T1D)
Number of significant surrogate variables is: 3
Iteration (out of 5 ):1 2 3 4 5
names(sv3)
[1] "sv" "pprob.gam" "pprob.b" "n.sv"
#For SV1
modSV1 <- cbind(mod_T1D_AAb, sv1$sv)
mod0SV1 <- cbind(mod0_T1D_AAb, sv1$sv)
pvalsSV1 <- f.pvalue(assays(se.filt_T1D_AAb)$logCPM, modSV1, mod0SV1)
sum(p.adjust(pvalsSV1, method="fdr") < 0.05)
[1] 0
#For SV2
modSV2 <- cbind(mod_AAb_ND, sv2$sv)
mod0SV2 <- cbind(mod0_AAb_ND, sv2$sv)
pvalsSV2 <- f.pvalue(assays(se.filt_AAb_ND)$logCPM, modSV2, mod0SV2)
sum(p.adjust(pvalsSV2, method="BH") < 0.05)
[1] 0
#For SV3
modSV3 <- cbind(mod_ND_T1D, sv3$sv)
mod0SV3 <- cbind(mod0_ND_T1D, sv3$sv)
pvalsSV3 <- f.pvalue(assays(se.filt_ND_T1D)$logCPM, modSV3, mod0SV3)
sum(p.adjust(pvalsSV3, method="BH") < 0.05)
[1] 315
Adjusting for surrogate variables (SVs) in gene expression analysis did not reveal any differentially expressed genes between pair of conditions except for 315 DE genes between T1D and ND samples, higher than the 230 identified previously at FDR < 5%.
Figure 13: Distribution of raw p-values for an F-test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples
We can see that the histograms of the p-value distribution are more uniform now (Figure 13).
Now we build a table with the differentially expressed (DE) genes having an FDR < 10%, only for the comparison T1D and ND that is the only one that has any DE gene adjusting for surrogate variables (SV).
Finally for the comparison between ND and T1D. We obtain 841 DE genes at FDR < 10%. Below, we present the top-10 genes with the lowest p-values in Table 3.
| EntrezID | Symbol | Description | P value | |
|---|---|---|---|---|
| 1 | 3134 | HLA-F | major histocompatibility complex, class I, F | 1.00e-07 |
| 2 | 4495 | MT1G | metallothionein 1G | 1.30e-06 |
| 3 | 10670 | RRAGA | Ras related GTP binding A | 1.70e-06 |
| 4 | 9844 | ELMO1 | engulfment and cell motility 1 | 1.70e-06 |
| 5 | 6726 | SRP9 | signal recognition particle 9 | 2.30e-06 |
| 6 | 23328 | SASH1 | SAM and SH3 domain containing 1 | 2.40e-06 |
| 7 | 972 | CD74 | CD74 molecule | 6.60e-06 |
| 8 | 5861 | RAB1A | RAB1A, member RAS oncogene family | 1.06e-05 |
| 9 | 90987 | ZNF251 | zinc finger protein 251 | 1.58e-05 |
| 10 | 3033 | HADH | hydroxyacyl-CoA dehydrogenase | 1.78e-05 |
In order to analyze the differential expression (DE) between the three conditions, we will use the Bioconductor package limma, which implements procedures to carry out DE analyses on RNA-seq data implementing the mean-variance relationship into linear regression models. We have analyzed DE without taking into account covariates, taking into account age as a confounding variable and also taking into account age and unknown covariates. Also we have used both the limma-trend and limma-voom pipelines. The results were more or less similar but in our case, the dataset consist of a relatively small number of replicates (8 ND, 7 T1D, and 3 AAb) and minimal variation in library sizes, so we will use limma-trend pipeline.
# T1D vs AAb
se.filt_T1D_AAb$disease_state <- relevel(se.filt_T1D_AAb$disease_state, ref="T1D")
mod1 <- model.matrix(~ disease_state + age, data=colData(se.filt_T1D_AAb))
mod0.1 <- model.matrix(~age, data=colData(se.filt_T1D_AAb))
sv1 <- sva(assays(se.filt_T1D_AAb)$logCPM, mod=mod1, mod0=mod0.1)
Number of significant surrogate variables is: 2
Iteration (out of 5 ):1 2 3 4 5
mod1 <- cbind(mod1, sv1$sv)
colnames(mod1) <- c(colnames(mod1)[1:3], paste0("SV", 1:sv1$n))
head(mod1, 3)
(Intercept) disease_stateAAb age SV1 SV2
SRR15372438 1 1 23.00 0.2723930 -0.1973451
SRR15372439 1 1 17.65 0.1995298 -0.1710604
SRR15372440 1 1 22.10 0.2350359 -0.1471492
head(rowData(se.filt_AAb_ND))
DataFrame with 6 rows and 5 columns
gene_id gene_biotype description gene_id_version
<character> <character> <character> <character>
100 ENSG00000196839 protein_coding adenosine deaminase .. ENSG00000196839.12
1000 ENSG00000170558 protein_coding cadherin 2 [Source:H.. ENSG00000170558.8
10000 ENSG00000117020 protein_coding AKT serine/threonine.. ENSG00000117020.16
10001 ENSG00000133997 protein_coding mediator complex sub.. ENSG00000133997.11
10003 ENSG00000077616 protein_coding N-acetylated alpha-l.. ENSG00000077616.10
10004 ENSG00000168060 protein_coding N-acetylated alpha-l.. ENSG00000168060.15
symbol
<character>
100 ADA
1000 CDH2
10000 AKT3
10001 MED6
10003 NAALAD2
10004 NAALADL1
#Fit the model
fit1 <- lmFit(assays(se.filt_T1D_AAb)$logCPM, mod1)
#Calculate moderate t statistics
fit1 <- eBayes(fit1, trend=TRUE)
#Add gene metadata
genesmd <- data.frame(chr=as.character(seqnames(rowRanges(se.filt_T1D_AAb))),
symbol=rowData(se.filt_T1D_AAb)[, 5],
stringsAsFactors=FALSE)
fit1$genes <- genesmd
tt1 <- topTable(fit1, coef=2, n=Inf)
DEgenes1 <- rownames(tt1)[tt1$adj.P.Val < 0.1]
length(DEgenes1)
[1] 0
# AAb vs ND
se.filt_AAb_ND$disease_state <- relevel(se.filt_AAb_ND$disease_state, ref = "AAb")
mod2 <- model.matrix(~ disease_state + age, data = colData(se.filt_AAb_ND))
mod0.2 <- model.matrix(~ age, data = colData(se.filt_AAb_ND))
sv2 <- sva(assays(se.filt_AAb_ND)$logCPM, mod = mod2, mod0 = mod0.2)
Number of significant surrogate variables is: 2
Iteration (out of 5 ):1 2 3 4 5
mod2 <- cbind(mod2, sv2$sv)
colnames(mod2) <- c(colnames(mod2)[1:3], paste0("SV", 1:sv2$n))
head(mod2, 3)
(Intercept) disease_stateND age SV1 SV2
SRR15372438 1 0 23.00 -0.2577965 -0.1856249
SRR15372439 1 0 17.65 -0.1373232 0.5120628
SRR15372440 1 0 22.10 -0.3087993 0.2482733
# Fit the model
fit2 <- lmFit(assays(se.filt_AAb_ND)$logCPM, mod2)
# Calculate moderated t statistics
fit2 <- eBayes(fit2, trend = TRUE)
# Add gene metadata
genesmd2 <- data.frame(chr = as.character(seqnames(rowRanges(se.filt_AAb_ND))),
symbol = rowData(se.filt_AAb_ND)[, 5],
stringsAsFactors = FALSE)
fit2$genes <- genesmd2
tt2 <- topTable(fit2, coef = 2, n = Inf)
DEgenes2 <- rownames(tt2)[tt2$adj.P.Val < 0.1]
length(DEgenes2)
[1] 2
# ND vs T1D
se.filt_ND_T1D$disease_state <- relevel(se.filt_ND_T1D$disease_state, ref = "ND")
mod3 <- model.matrix(~ disease_state + age, data = colData(se.filt_ND_T1D))
mod0.3 <- model.matrix(~ age, data = colData(se.filt_ND_T1D))
sv3 <- sva(assays(se.filt_ND_T1D)$logCPM, mod = mod3, mod0 = mod0.3)
Number of significant surrogate variables is: 3
Iteration (out of 5 ):1 2 3 4 5
mod3 <- cbind(mod3, sv3$sv)
colnames(mod3) <- c(colnames(mod3)[1:3], paste0("SV", 1:sv3$n))
head(mod3, 3)
(Intercept) disease_stateT1D age SV1 SV2 SV3
SRR15372441 1 0 24.2 -0.1916294 -0.2056604 -0.05426709
SRR15372442 1 0 14.0 -0.0116619 0.2596077 0.23688536
SRR15372443 1 0 20.0 -0.2585012 -0.1912782 -0.12645233
# Fit the model
fit3 <- lmFit(assays(se.filt_ND_T1D)$logCPM, mod3)
# Calculate moderated t statistics
fit3 <- eBayes(fit3, trend = TRUE)
# Add gene metadata
genesmd3 <- data.frame(chr = as.character(seqnames(rowRanges(se.filt_ND_T1D))),
symbol = rowData(se.filt_ND_T1D)[, 5],
stringsAsFactors = FALSE)
fit3$genes <- genesmd3
tt3 <- topTable(fit3, coef = 2, n = Inf)
DEgenes3 <- rownames(tt3)[tt3$adj.P.Val < 0.1]
length(DEgenes3)
[1] 2019
There are no DE genes between T1D and AAb samples whereas We identified 2 DE genes at FDR < 10% between AAb and ND patients and 2019 DE genes at FDR < 10% between ND and T1D samples. The distribution of the resulting p-values is shown in Figure 14.
Figure 14: Distribution of raw p-values and qq plots of the moderated t statistics for the test on every gene between T1D vs AAb, AAb vs ND and ND vs T1D samples with limma-trend adjusting for known and unknown variables
We can observe in Figure 14 that the distribution of the raw p-values is the expected, since we see the maximum peak at the left end of the distribution, and the rest is more or less uniform in ND vs T1D. On the other side we don’t see that peak in T1D vs AAb and AAb vs ND where distributions are more or less uniform. This is consistent with the fact that the comparison T1D vs AAb has no DE genes and AAb vs ND only two at FDR < 10%.
In the case of the QQplot, we can see that we have not any/many DE genes between AAb and T1D or AAb vs ND as the points in the QQ plot lie close to the reference line (the theoretical line), it indicates that the observed t-statistics follow the expected distribution under the null hypothesis. This suggests that there are no substantial deviations, and the null hypothesis (no differential expression) is mostly true for these comparisons.
In the case of ND vs T1D, in this comparison, the QQ plot shows significant deviations from the theoretical line at both extremes (negative and positive values). This deviation indicates that the observed t-statistics are more extreme than what would be expected under the null hypothesis. A downward deviation at the lower end suggests an excess of negative t-statistics, indicating more genes are significantly downregulated in ND compared to T1D whereas an upward deviation at the upper end suggests an excess of positive t-statistics, indicating more genes are significantly upregulated in ND compared to T1D.
Moreover, we can use volcano plots to examine the extent of DE for a contrast of interest.
Figure 15: Volcano Plots showing DE genes, with upregulated and downregulated genes in the right and left respectively
We can se that no DE genes were found between AAb and T1D in Figure 15. In the case of ND vs AAb, there is only two DE genes, one downregulated and the other upregulated. On the other hand, more than 2000 genes have a differential expression in the ND vs T1D condition. In Figure 14 we can see the top 10 genes names.
Our next goal is to contextualize the differentially expressed genes identified in the previous section to determine which biological processes are involved. A common method for functional enrichment analysis involves conducting a Gene Ontology (GO) analysis, which typically entails applying the one-tailed Fisher’s exact test to each gene set in the GO database. It will be done only for the conditions T1D vs ND as the others present a few or no DE genes.
library(org.Hs.eg.db)
library(GOstats)
# Define geneUniverse as all genes
geneUniverse <- rownames(se)
# ND vs T1D (tt3)
# Build parameters object
params3 <- new("GOHyperGParams", geneIds=DEgenes3,
universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.05, testDirection="over")
#Run functional enrichment
hgOver3 <- hyperGTest(params3)
hgOver3
Gene to GO BP test for over-representation
9504 GO BP ids tested (1718 have p < 0.05)
Selected gene set size: 1850
Gene universe size: 18098
Annotation package: org.Hs.eg
# Results
resHgOver3<- summary(hgOver3)
dim(resHgOver3)
[1] 1718 7
head(resHgOver3)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0051179 3.116919e-25 1.702443 540.4437 737 5287
2 GO:0051641 3.196177e-25 1.818155 343.3611 515 3359
3 GO:0070727 4.513427e-25 1.924526 255.9620 410 2504
4 GO:0008104 6.806964e-25 1.921959 254.8376 408 2493
5 GO:0051234 2.127719e-23 1.692510 475.7377 658 4654
6 GO:0045184 2.146591e-22 2.036221 172.9583 297 1692
Term
1 localization
2 cellular localization
3 cellular macromolecule localization
4 protein localization
5 establishment of localization
6 establishment of protein localization
Figure 16: Distribution of p-values of the GO enrichment analysis
We can se in Figure 16 that the P-value distribution is enriched towards low P-values, which often makes this type of analysis too optimistic. A challenge in Gene Ontology (GO) analysis arises from the interdependence of terms due to their hierarchical structure and overlapping gene sets. To address this, Alexa, Rahnenführer, and Lengauer (2006) propose computing the significance of a GO term conditional on the significance of its children. This involves performing a bottom-up conditional test, where genes associated with a significant child term are removed from its parent terms. This approach prioritizes more specific and relevant terms in the analysis.
# ND vs T1D (tt3)
conditional(params3) <- TRUE
hgOverCond3 <- hyperGTest(params3)
hgOverCond3
Gene to GO BP Conditional test for over-representation
9504 GO BP ids tested (973 have p < 0.05)
Selected gene set size: 1850
Gene universe size: 18098
Annotation package: org.Hs.eg
#into a df
resHgOverCond3 <- summary(hgOverCond3)
dim(resHgOverCond3)
[1] 973 7
head(resHgOverCond3)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0033036 8.637072e-21 1.754928 305.33484 453 2987
2 GO:0032879 6.837364e-14 1.695413 202.30631 302 1985
3 GO:0051641 2.871855e-11 1.784969 121.31310 192 1272
4 GO:0051649 5.145642e-11 1.641630 172.84255 254 1713
5 GO:0071840 9.318193e-10 1.351808 685.18897 805 6703
6 GO:0071692 5.558165e-09 2.262064 38.63963 76 378
Term
1 macromolecule localization
2 regulation of localization
3 cellular localization
4 establishment of localization in cell
5 cellular component organization or biogenesis
6 protein localization to extracellular region
#Filter with a minimum number of count and size
mask <- resHgOverCond3$Size >= 3 & resHgOverCond3$Size <= 300 & + resHgOverCond3$Count >= 3
resHgOverCond3 <- resHgOverCond3[mask, ]
dim(resHgOverCond3)
[1] 702 7
ord <- order(resHgOverCond3$OddsRatio, decreasing=TRUE)
resHgOverCond3 <- resHgOverCond3[ord, ]
#Extratc genes of each GO term
geneIDs <- geneIdsByCategory(hgOverCond3)[resHgOverCond3$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) mapIds(org.Hs.eg.db, keys = id, column = "SYMBOL", keytype = "ENTREZID"))
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresults <- cbind(resHgOverCond3, Genes=geneSYMs)
rownames(goresults) <- 1:nrow(goresults)
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0002477 1.066574e-03 Inf 0.3066637 3 3
2 GO:0006613 4.931174e-04 35.34241 0.5093233 4 5
3 GO:0002502 4.999107e-04 35.20477 0.5111062 4 5
4 GO:0030070 2.641905e-05 26.43059 0.8177699 6 8
5 GO:0014809 3.939691e-03 26.38928 0.4088850 3 4
6 GO:0019249 3.939691e-03 26.38928 0.4088850 3 4
Term
1 antigen processing and presentation of exogenous peptide antigen via MHC class Ib
2 cotranslational protein targeting to membrane
3 peptide antigen assembly with MHC class I protein complex
4 insulin processing
5 regulation of skeletal muscle contraction by regulation of release of sequestered calcium ion
6 lactate biosynthetic process
Genes
1 HLA-E, HLA-F, TAP2
2 ARL6IP1, SIL1, SSR1, SSR2
3 PDIA3, HLA-A, TAPBPL, TAPBP
4 CPE, SLC30A8, P4HB, PCSK1, ERO1B, YIPF5
5 DMD, GSTM2, GSTO1
6 PARK7, GATD1, PER2
ktab <- kable(goresults, "html", caption="GO results.") %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"), fixed_thead = TRUE)
fnameHTML_ND_D <- "ND_T1D_go.html"
fpathHTML_ND_D <- file.path(path2pkg, "doc", fnameHTML_ND_D)
save_kable(ktab, file = fpathHTML_ND_D, self_contained = TRUE)
Figure 17: Distribution of p-values of the GO enrichment analysis after conditionaltest
The number of significant GO terms has decreased and the histogram of Figure 17 shows that it is now less biased to low pvalues after doing the conditional test.
For a better representation of the topGO terms we visualize them using a Dotplot.
Figure 18: Dotplot of the top 25 GO enriched terms in T1D vs ND comparison
We can see in 18 several significantly enriched biological processes. Antigen processing and presentation, both endogenous and exogenous, highlight the autoimmune nature of type 1 diabetes, leading to beta-cell destruction. Processes like gluconeogenesis, glucose homeostasis, and hexose metabolism underscore metabolic dysregulation, resulting in hyperglycemia. Protein-related processes, including folding in the ER, localization to membranes, and positive regulation of protein transport and secretion, emphasize the cellular stress and impaired insulin production characteristic of type 1.
In addition to performing functional analysis with Gene Ontology (GO), we employed the Gene Set Enrichment Analysis (GSEA) approach to assess pathway enrichment for each gene set. GSEA calculates an enrichment score (ES) based on the position of genes within a ranked list of changes in gene expression. Significance then, is calculated by generating an empirical null distribution of ES values and either (1) permute phenotype labels; or (2) permute gene set memberships. In our case we will permute gene set memberships. Permuting gene set memberships is a valuable approach for evaluating differences between pairs of conditions in gene set enrichment analysis. This method allows us to identify gene sets whose functions or characteristics are inherently associated with differential expression patterns across conditions, providing insights into the underlying biological processes driving these differences.
GSEA is particularly adept at detecting subtle yet coordinated changes in gene expression, offering insights into the underlying molecular mechanisms implicated in the disease. This approach enables us to elucidate the pathways and biological processes disrupted in diabetes, even in instances where highly significant differentially expressed genes may be lacking in certain conditions.
To conduct the GSEA analysis, we need a collection of gene sets. For this purpose, we will use the GeneSetCollection called c2BroadSets from the GSVA, which includes curated gene sets from various sources such as online pathway databases and biomedical literature. Specifically, we will use the KEGG, REACTOME, and BIOCARTA pathways from this GeneSetCollection, which are well-regarded databases providing curated information about biological pathways and networks.
library(GSVAdata)
# C2 Curated Gene Sets
data(c2BroadSets)
c2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
grep("^REACTOME", names(c2BroadSets)), grep("^BIOCARTA", names(c2BroadSets)))]
gsc<-GeneSetCollection(c2BroadSets)
library(fgsea)
gsets <- geneIds(gsc)
#AAb vs T1D
stats1 <- tt1$t
names(stats1) <- rownames(tt1)
fgseares1 <- fgsea(gsets, stats1, minSize=5, maxSize=300)
#AAb vs ND
stats2 <- tt2$t
names(stats2) <- rownames(tt2)
fgseares2 <- fgsea(gsets, stats2, minSize=5, maxSize=300)
#T1D vs ND
stats3 <- tt3$t
names(stats3) <- rownames(tt3)
fgseares3 <- fgsea(gsets, stats3, minSize=5, maxSize=300)
After the analysis, we filter the resuls based on a FDR threshold of less than 1%.
The enriched pathways with a FDR threshold of less than 1% between AAb and T1D samples are shown in Table 4 below:
| Pathway | PValue | PadjValue | ES | NES | Size | LeadingEdge |
|---|---|---|---|---|---|---|
| KEGG_PROTEIN_EXPORT | 6.90e-06 | 0.0019422 | 0.6959011 | 2.265054 | 23 | 90701, 5…. |
| REACTOME_DOWNSTREAM_EVENTS_IN_GPCR_SIGNALING | 7.20e-06 | 0.0019422 | -0.3800618 | -1.797856 | 180 | 2786, 57…. |
| REACTOME_TRNA_AMINOACYLATION | 5.00e-06 | 0.0019422 | 0.5975938 | 2.225263 | 39 | 57176, 6…. |
| REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION | 1.14e-05 | 0.0022942 | 0.3756189 | 1.839260 | 150 | 5223, 21…. |
| REACTOME_MITOTIC_M_M_G1_PHASES | 2.32e-05 | 0.0037502 | 0.3935004 | 1.851023 | 121 | 5714, 63…. |
| KEGG_PARKINSONS_DISEASE | 2.80e-05 | 0.0037630 | 0.4040849 | 1.889321 | 114 | 2861, 29…. |
| KEGG_MELANOGENESIS | 4.87e-05 | 0.0056130 | -0.5057011 | -1.987588 | 57 | 7482, 19…. |
Now we represent the enriched pathways between AAb and T1D in a dot plot (Figure 19).
Figure 19: Dotplot of the top 25 GSEA enriched pathways between AAb vs T1D
The enriched pathways with a FDR threshold of less than 1% between AAb and ND samples are shown in Table 5 below:
| Pathway | PValue | PadjValue | ES | NES | Size | LeadingEdge |
|---|---|---|---|---|---|---|
| KEGG_RIBOSOME | 0.0000000 | 0.0000000 | 0.8064631 | 3.452906 | 79 | 6206, 62…. |
| REACTOME_PEPTIDE_CHAIN_ELONGATION | 0.0000000 | 0.0000000 | 0.8085119 | 3.453679 | 77 | 6206, 62…. |
| REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS | 0.0000000 | 0.0000000 | 0.7720358 | 3.355927 | 87 | 6206, 62…. |
| REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION | 0.0000000 | 0.0000000 | 0.7690621 | 3.392994 | 91 | 6206, 38…. |
| REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS | 0.0000000 | 0.0000000 | 0.7651946 | 3.375931 | 91 | 6206, 62…. |
| REACTOME_TRANSLATION | 0.0000000 | 0.0000000 | 0.7263010 | 3.326538 | 110 | 6206, 62…. |
| REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT | 0.0000000 | 0.0000000 | 0.7455611 | 3.327478 | 97 | 6206, 62…. |
| REACTOME_VIRAL_MRNA_TRANSLATION | 0.0000000 | 0.0000000 | 0.7956518 | 3.398745 | 77 | 6206, 62…. |
| REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT | 0.0000000 | 0.0000000 | 0.7346015 | 3.303028 | 100 | 6206, 62…. |
| REACTOME_INSULIN_SYNTHESIS_AND_SECRETION | 0.0000000 | 0.0000000 | 0.6812328 | 3.147517 | 122 | 6206, 62…. |
| REACTOME_INFLUENZA_LIFE_CYCLE | 0.0000000 | 0.0000000 | 0.6296879 | 2.920051 | 128 | 6206, 38…. |
| REACTOME_METABOLISM_OF_PROTEINS | 0.0000000 | 0.0000000 | 0.5606133 | 2.764842 | 193 | 6206, 78…. |
| KEGG_OXIDATIVE_PHOSPHORYLATION | 0.0000000 | 0.0000000 | 0.6286930 | 2.889053 | 112 | 4537, 47…. |
| REACTOME_SIGNALING_IN_IMMUNE_SYSTEM | 0.0000000 | 0.0000000 | -0.5097911 | -2.491415 | 203 | 3122, 71…. |
| REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION | 0.0000000 | 0.0000000 | 0.5481320 | 2.630998 | 150 | 4191, 45…. |
| REACTOME_ELECTRON_TRANSPORT_CHAIN | 0.0000000 | 0.0000000 | 0.6769859 | 2.840892 | 71 | 4537, 47…. |
| KEGG_PARKINSONS_DISEASE | 0.0000000 | 0.0000000 | 0.5707364 | 2.621775 | 114 | 4537, 47…. |
| REACTOME_INTEGRATION_OF_ENERGY_METABOLISM | 0.0000000 | 0.0000000 | 0.4692747 | 2.325461 | 206 | 2786, 41…. |
| REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX | 0.0000000 | 0.0000000 | 0.7466731 | 2.802486 | 44 | 6206, 62…. |
| REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION | 0.0000000 | 0.0000000 | 0.7128553 | 2.751623 | 50 | 6206, 62…. |
| REACTOME_REGULATION_OF_INSULIN_SECRETION | 0.0000000 | 0.0000000 | 0.4831362 | 2.373890 | 185 | 2786, 41…. |
| KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION | 0.0000000 | 0.0000000 | -0.7234985 | -2.740426 | 45 | 972, 312…. |
| KEGG_ALLOGRAFT_REJECTION | 0.0000000 | 0.0000000 | -0.9139404 | -2.688664 | 16 | 3122, 31…. |
| KEGG_AUTOIMMUNE_THYROID_DISEASE | 0.0000000 | 0.0000000 | -0.9139404 | -2.688664 | 16 | 3122, 31…. |
| KEGG_GRAFT_VERSUS_HOST_DISEASE | 0.0000000 | 0.0000000 | -0.9114850 | -2.630330 | 15 | 3122, 31…. |
| KEGG_VIRAL_MYOCARDITIS | 0.0000000 | 0.0000001 | -0.6838371 | -2.578884 | 44 | 3122, 31…. |
| KEGG_HUNTINGTONS_DISEASE | 0.0000000 | 0.0000001 | 0.4613188 | 2.213853 | 151 | 4707, 29…. |
| REACTOME_PD1_SIGNALING | 0.0000000 | 0.0000002 | -0.8915716 | -2.529687 | 14 | 3122, 31…. |
| KEGG_TYPE_I_DIABETES_MELLITUS | 0.0000000 | 0.0000004 | -0.8028364 | -2.550373 | 22 | 3122, 31…. |
| REACTOME_SYNTHESIS_OF_DNA | 0.0000001 | 0.0000017 | 0.5370809 | 2.297133 | 78 | 6119, 23…. |
| REACTOME_COSTIMULATION_BY_THE_CD28_FAMILY | 0.0000001 | 0.0000019 | -0.6420463 | -2.439233 | 46 | 3122, 31…. |
| KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS | 0.0000001 | 0.0000031 | -0.6325766 | -2.420948 | 47 | 3122, 71…. |
| REACTOME_PHOSPHORYLATION_OF_CD3_AND_TCR_ZETA_CHAINS | 0.0000002 | 0.0000044 | -0.8709451 | -2.445945 | 13 | 3122, 31…. |
| KEGG_CELL_ADHESION_MOLECULES_CAMS | 0.0000002 | 0.0000057 | -0.5362391 | -2.228927 | 76 | 3122, 31…. |
| REACTOME_S_PHASE | 0.0000003 | 0.0000063 | 0.5026120 | 2.196143 | 89 | 6119, 23…. |
| REACTOME_TCR_SIGNALING | 0.0000006 | 0.0000130 | -0.6389978 | -2.359680 | 41 | 3122, 31…. |
| KEGG_ASTHMA | 0.0000009 | 0.0000201 | -0.8654647 | -2.367582 | 12 | 3122, 31…. |
| KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION | 0.0000009 | 0.0000201 | -0.7592434 | -2.382771 | 21 | 3122, 31…. |
| REACTOME_G1_S_TRANSITION | 0.0000016 | 0.0000329 | 0.4864347 | 2.085497 | 83 | 6119, 23…. |
| REACTOME_DOWNSTREAM_TCR_SIGNALING | 0.0000019 | 0.0000373 | -0.6926346 | -2.371359 | 29 | 3122, 31…. |
| REACTOME_TRANSLOCATION_OF_ZAP70_TO_IMMUNOLOGICAL_SYNAPSE | 0.0000020 | 0.0000388 | -0.8934702 | -2.341962 | 10 | 3122, 31…. |
| KEGG_ALZHEIMERS_DISEASE | 0.0000022 | 0.0000428 | 0.4175936 | 1.956413 | 136 | 7132, 47…. |
| REACTOME_GENERATION_OF_SECOND_MESSENGER_MOLECULES | 0.0000028 | 0.0000529 | -0.7700026 | -2.368328 | 19 | 3122, 31…. |
| REACTOME_DNA_REPLICATION_PRE_INITIATION | 0.0000053 | 0.0000981 | 0.5063248 | 2.093552 | 67 | 6119, 23…. |
| KEGG_LEISHMANIA_INFECTION | 0.0000063 | 0.0001125 | -0.5831280 | -2.208739 | 45 | 3122, 31…. |
| REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS | 0.0000098 | 0.0001712 | 0.6709400 | 2.259362 | 27 | 6119, 23…. |
| REACTOME_G2_M_CHECKPOINTS | 0.0000173 | 0.0002979 | 0.6221570 | 2.167031 | 31 | 6119, 23…. |
| KEGG_COMPLEMENT_AND_COAGULATION_CASCADES | 0.0000184 | 0.0003096 | -0.5810349 | -2.180938 | 43 | 713, 712…. |
| REACTOME_IMMUNOREGULATORY_INTERACTIONS_BETWEEN_A_LYMPHOID_AND_A_NON_LYMPHOID_CELL | 0.0000307 | 0.0005059 | -0.6799889 | -2.187130 | 24 | 2214, 31…. |
| BIOCARTA_BLYMPHOCYTE_PATHWAY | 0.0000318 | 0.0005139 | -0.9326611 | -2.031101 | 6 | 3122, 31…. |
| REACTOME_ORC1_REMOVAL_FROM_CHROMATIN | 0.0000335 | 0.0005299 | 0.5194842 | 2.081748 | 57 | 23595, 1…. |
| REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING | 0.0000372 | 0.0005776 | 0.7491271 | 2.208718 | 15 | 9551, 45…. |
| REACTOME_CELL_CYCLE_CHECKPOINTS | 0.0000448 | 0.0006819 | 0.4353824 | 1.932005 | 93 | 6119, 23…. |
| BIOCARTA_COMP_PATHWAY | 0.0000603 | 0.0008855 | -0.8456689 | -2.216665 | 10 | 713, 712…. |
| REACTOME_COMPLEMENT_CASCADE | 0.0000603 | 0.0008855 | -0.8456689 | -2.216665 | 10 | 713, 712…. |
| REACTOME_INITIAL_TRIGGERING_OF_COMPLEMENT | 0.0000664 | 0.0009567 | -0.8953598 | -2.087885 | 7 | 713, 712…. |
| KEGG_DNA_REPLICATION | 0.0000795 | 0.0011258 | 0.5941479 | 2.089992 | 33 | 6119, 59…. |
| BIOCARTA_TCRA_PATHWAY | 0.0001105 | 0.0015136 | -0.9348933 | -1.933580 | 5 | 3122, 31…. |
| REACTOME_CELL_CYCLE_MITOTIC | 0.0001107 | 0.0015136 | 0.3141142 | 1.606072 | 244 | 9183, 61…. |
| BIOCARTA_CLASSIC_PATHWAY | 0.0001153 | 0.0015507 | -0.8483118 | -2.142589 | 9 | 713, 712…. |
| REACTOME_HEMOSTASIS | 0.0001216 | 0.0016090 | -0.3560318 | -1.716927 | 187 | 3708, 33…. |
| REACTOME_CLASSICAL_ANTIBODY_MEDIATED_COMPLEMENT_ACTIVATION | 0.0001409 | 0.0018336 | -0.9324302 | -1.928486 | 5 | 713, 712…. |
| REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ | 0.0001595 | 0.0020426 | 0.4967815 | 1.948640 | 53 | 1017, 59…. |
| REACTOME_M_G1_TRANSITION | 0.0002867 | 0.0036145 | 0.4890226 | 1.940914 | 55 | 23595, 5…. |
| BIOCARTA_TH1TH2_PATHWAY | 0.0003363 | 0.0041755 | -0.8957833 | -1.950790 | 6 | 3122, 31…. |
| REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G | 0.0003630 | 0.0044389 | 0.5156384 | 1.959590 | 47 | 5705, 56…. |
| REACTOME_PECAM1_INTERACTIONS | 0.0004006 | 0.0048255 | -0.8237554 | -2.080567 | 9 | 4067, 25…. |
| KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY | 0.0004255 | 0.0050501 | -0.4585651 | -1.852614 | 62 | 2214, 31…. |
| REACTOME_REPAIR_SYNTHESIS_OF_PATCH_27_30_BASES_LONG_BY_DNA_POLYMERASE | 0.0005044 | 0.0058988 | 0.7119880 | 2.028239 | 14 | 6119, 59…. |
| REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 | 0.0005245 | 0.0060465 | 0.4930899 | 1.884001 | 48 | 1017, 65…. |
| REACTOME_PHASE_1_FUNCTIONALIZATION | 0.0006601 | 0.0075028 | -0.9040343 | -1.869757 | 5 | 4129, 69…. |
| REACTOME_INTEGRIN_CELL_SURFACE_INTERACTIONS | 0.0007459 | 0.0083600 | -0.4604104 | -1.846990 | 61 | 3680, 12…. |
| REACTOME_MRNA_SPLICING_MINOR_PATHWAY | 0.0009015 | 0.0099654 | 0.5125211 | 1.886393 | 40 | 6637, 10…. |
Now we represent the top 25 enriched pathways between AAb and ND in a dot plot (Figure 20).
Figure 20: Dotplot of the top 25 GSEA enriched pathways between AAb and ND
The enriched pathways with a FDR threshold of less than 1% between T1D and ND samples are shown in Table 6 below:
| Pathway | PValue | PadjValue | ES | NES | Size | LeadingEdge |
|---|---|---|---|---|---|---|
| KEGG_OXIDATIVE_PHOSPHORYLATION | 0.0000000 | 0.0000000 | -0.6137227 | -2.703860 | 112 | 51382, 5…. |
| REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION | 0.0000000 | 0.0000000 | -0.5591489 | -2.574392 | 150 | 6514, 41…. |
| REACTOME_REGULATION_OF_INSULIN_SECRETION | 0.0000000 | 0.0000000 | -0.5278999 | -2.498343 | 185 | 6514, 41…. |
| KEGG_PARKINSONS_DISEASE | 0.0000000 | 0.0000000 | -0.5779166 | -2.561744 | 114 | 7345, 46…. |
| REACTOME_INTEGRATION_OF_ENERGY_METABOLISM | 0.0000000 | 0.0000000 | -0.4857029 | -2.317466 | 206 | 6514, 26…. |
| REACTOME_SIGNALING_IN_IMMUNE_SYSTEM | 0.0000000 | 0.0000000 | 0.4473025 | 2.320561 | 203 | 3106, 31…. |
| KEGG_ALLOGRAFT_REJECTION | 0.0000000 | 0.0000000 | 0.9043587 | 2.719197 | 16 | 3106, 31…. |
| KEGG_AUTOIMMUNE_THYROID_DISEASE | 0.0000000 | 0.0000000 | 0.9043587 | 2.719197 | 16 | 3106, 31…. |
| KEGG_GRAFT_VERSUS_HOST_DISEASE | 0.0000000 | 0.0000000 | 0.9004719 | 2.677130 | 15 | 3106, 31…. |
| KEGG_VIRAL_MYOCARDITIS | 0.0000000 | 0.0000001 | 0.6704811 | 2.623891 | 44 | 3106, 31…. |
| REACTOME_CLASSICAL_ANTIBODY_MEDIATED_COMPLEMENT_ACTIVATION | 0.0000000 | 0.0000002 | 0.9920361 | 2.062466 | 5 | 713, 716…. |
| KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION | 0.0000000 | 0.0000002 | 0.6524344 | 2.569837 | 45 | 972, 310…. |
| REACTOME_ELECTRON_TRANSPORT_CHAIN | 0.0000000 | 0.0000004 | -0.5868998 | -2.375017 | 71 | 4698, 54…. |
| KEGG_LEISHMANIA_INFECTION | 0.0000000 | 0.0000005 | 0.6404309 | 2.522557 | 45 | 3122, 31…. |
| KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION | 0.0000000 | 0.0000005 | 0.7889566 | 2.584008 | 21 | 3122, 31…. |
| REACTOME_IMMUNOREGULATORY_INTERACTIONS_BETWEEN_A_LYMPHOID_AND_A_NON_LYMPHOID_CELL | 0.0000000 | 0.0000008 | 0.7606774 | 2.555285 | 24 | 3106, 31…. |
| REACTOME_INSULIN_SYNTHESIS_AND_SECRETION | 0.0000000 | 0.0000008 | -0.4946990 | -2.223067 | 122 | 90701, 6…. |
| KEGG_CELL_ADHESION_MOLECULES_CAMS | 0.0000000 | 0.0000015 | 0.5402633 | 2.374274 | 76 | 3106, 31…. |
| KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY | 0.0000000 | 0.0000015 | 0.5689888 | 2.403614 | 62 | 3106, 31…. |
| KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION | 0.0000000 | 0.0000017 | 0.5101788 | 2.292957 | 87 | 1436, 14…. |
| REACTOME_MITOTIC_M_M_G1_PHASES | 0.0000002 | 0.0000093 | -0.4692216 | -2.098299 | 121 | 5684, 63…. |
| KEGG_HUNTINGTONS_DISEASE | 0.0000003 | 0.0000096 | -0.4418634 | -2.040986 | 151 | 5332, 46…. |
| KEGG_ALZHEIMERS_DISEASE | 0.0000004 | 0.0000127 | -0.4519475 | -2.052732 | 136 | 5332, 46…. |
| REACTOME_CELL_CYCLE_MITOTIC | 0.0000005 | 0.0000154 | -0.3793156 | -1.862349 | 244 | 7846, 61…. |
| KEGG_PROTEIN_EXPORT | 0.0000010 | 0.0000307 | -0.7323547 | -2.292166 | 23 | 90701, 6…. |
| REACTOME_CELL_CYCLE_CHECKPOINTS | 0.0000012 | 0.0000380 | -0.4761054 | -2.039567 | 93 | 6119, 56…. |
| REACTOME_G1_S_TRANSITION | 0.0000016 | 0.0000476 | -0.5022694 | -2.104512 | 83 | 6119, 56…. |
| KEGG_ERBB_SIGNALING_PATHWAY | 0.0000024 | 0.0000686 | 0.5021515 | 2.168703 | 71 | 2064, 20…. |
| REACTOME_DNA_REPLICATION_PRE_INITIATION | 0.0000025 | 0.0000686 | -0.5234875 | -2.083802 | 67 | 6119, 56…. |
| REACTOME_TCR_SIGNALING | 0.0000031 | 0.0000845 | 0.5915814 | 2.268108 | 41 | 3122, 31…. |
| REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A | 0.0000038 | 0.0000995 | -0.5513104 | -2.135964 | 58 | 5684, 57…. |
| REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE | 0.0000043 | 0.0001080 | -0.5195882 | -2.057018 | 65 | 5684, 57…. |
| KEGG_N_GLYCAN_BIOSYNTHESIS | 0.0000055 | 0.0001333 | -0.5924605 | -2.148438 | 42 | 1603, 85…. |
| BIOCARTA_ASBCELL_PATHWAY | 0.0000062 | 0.0001454 | 0.9606235 | 1.997158 | 5 | 3122, 31…. |
| REACTOME_SYNTHESIS_OF_DNA | 0.0000063 | 0.0001454 | -0.5010712 | -2.088059 | 78 | 6119, 56…. |
| REACTOME_S_PHASE | 0.0000077 | 0.0001719 | -0.4774796 | -2.022385 | 89 | 6119, 56…. |
| KEGG_PATHWAYS_IN_CANCER | 0.0000082 | 0.0001740 | 0.3294146 | 1.728251 | 224 | 27148, 1…. |
| REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE | 0.0000080 | 0.0001740 | -0.5771810 | -2.118154 | 46 | 5684, 57…. |
| KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS | 0.0000096 | 0.0001994 | 0.5496324 | 2.174887 | 47 | 3122, 31…. |
| REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE | 0.0000116 | 0.0002331 | -0.5806178 | -2.105493 | 42 | 5684, 57…. |
| KEGG_BASAL_CELL_CARCINOMA | 0.0000139 | 0.0002541 | 0.6686341 | 2.205788 | 22 | 27148, 2…. |
| REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX | 0.0000137 | 0.0002541 | -0.5636963 | -2.100732 | 48 | 5684, 57…. |
| REACTOME_GENERATION_OF_SECOND_MESSENGER_MOLECULES | 0.0000130 | 0.0002541 | 0.7356972 | 2.335570 | 19 | 3122, 31…. |
| REACTOME_PHOSPHORYLATION_OF_CD3_AND_TCR_ZETA_CHAINS | 0.0000136 | 0.0002541 | 0.8007130 | 2.211635 | 13 | 3122, 31…. |
| REACTOME_STABILIZATION_OF_P53 | 0.0000149 | 0.0002619 | -0.5655961 | -2.075640 | 46 | 5684, 57…. |
| REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G | 0.0000147 | 0.0002619 | -0.5651471 | -2.094956 | 47 | 5684, 57…. |
| REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC | 0.0000157 | 0.0002701 | -0.5397576 | -2.069112 | 56 | 5684, 57…. |
| BIOCARTA_CLASSIC_PATHWAY | 0.0000194 | 0.0003193 | 0.8685597 | 2.191558 | 9 | 713, 716…. |
| REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1 | 0.0000192 | 0.0003193 | -0.5614160 | -2.081125 | 47 | 5684, 57…. |
| KEGG_ASTHMA | 0.0000206 | 0.0003330 | 0.8070557 | 2.213283 | 12 | 3122, 31…. |
| REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_ | 0.0000226 | 0.0003398 | -0.5361763 | -2.053075 | 53 | 5684, 57…. |
| REACTOME_M_G1_TRANSITION | 0.0000226 | 0.0003398 | -0.5410323 | -2.080309 | 55 | 5684, 57…. |
| REACTOME_TRANSLOCATION_OF_ZAP70_TO_IMMUNOLOGICAL_SYNAPSE | 0.0000227 | 0.0003398 | 0.8508645 | 2.215076 | 10 | 3122, 31…. |
| REACTOME_TRNA_AMINOACYLATION | 0.0000220 | 0.0003398 | -0.5934683 | -2.122662 | 39 | 6897, 26…. |
| BIOCARTA_PROTEASOME_PATHWAY | 0.0000241 | 0.0003537 | -0.7386208 | -2.210451 | 19 | 5684, 61…. |
| REACTOME_METABOLISM_OF_AMINO_ACIDS | 0.0000255 | 0.0003670 | -0.4175072 | -1.880223 | 127 | 384, 494…. |
| REACTOME_INITIAL_TRIGGERING_OF_COMPLEMENT | 0.0000306 | 0.0004329 | 0.9058373 | 2.084542 | 7 | 713, 716…. |
| BIOCARTA_BLYMPHOCYTE_PATHWAY | 0.0000345 | 0.0004591 | 0.9180079 | 1.993582 | 6 | 3122, 36…. |
| KEGG_ENDOCYTOSIS | 0.0000340 | 0.0004591 | 0.3573656 | 1.779197 | 154 | 3106, 31…. |
| KEGG_TYPE_I_DIABETES_MELLITUS | 0.0000338 | 0.0004591 | 0.6504828 | 2.145908 | 22 | 3106, 31…. |
| REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21 | 0.0000347 | 0.0004591 | -0.5458952 | -2.034392 | 48 | 5684, 57…. |
| REACTOME_METABOLISM_OF_PROTEINS | 0.0000461 | 0.0006005 | -0.3684086 | -1.756189 | 193 | 6903, 78…. |
| REACTOME_SIGNALING_BY_WNT | 0.0000469 | 0.0006009 | -0.5155486 | -1.997411 | 58 | 5684, 57…. |
| BIOCARTA_INFLAM_PATHWAY | 0.0000682 | 0.0008594 | 0.8322496 | 2.166616 | 10 | 3122, 31…. |
| BIOCARTA_MAPK_PATHWAY | 0.0000727 | 0.0009032 | 0.4244014 | 1.873016 | 79 | 4214, 13…. |
| KEGG_PROTEASOME | 0.0000773 | 0.0009453 | -0.5651386 | -2.044756 | 41 | 5684, 57…. |
| BIOCARTA_COMP_PATHWAY | 0.0000812 | 0.0009636 | 0.8287682 | 2.157552 | 10 | 713, 716…. |
| REACTOME_COMPLEMENT_CASCADE | 0.0000812 | 0.0009636 | 0.8287682 | 2.157552 | 10 | 713, 716…. |
| REACTOME_INNATE_IMMUNITY_SIGNALING | 0.0000908 | 0.0010622 | 0.4182879 | 1.861669 | 82 | 713, 716…. |
| REACTOME_CYTOSOLIC_TRNA_AMINOACYLATION | 0.0001079 | 0.0012437 | -0.6527627 | -2.043055 | 23 | 6897, 26…. |
| REACTOME_ORC1_REMOVAL_FROM_CHROMATIN | 0.0001119 | 0.0012547 | -0.5015174 | -1.937437 | 57 | 5684, 57…. |
| REACTOME_PD1_SIGNALING | 0.0001106 | 0.0012547 | 0.7451292 | 2.148040 | 14 | 3122, 31…. |
| KEGG_MAPK_SIGNALING_PATHWAY | 0.0001343 | 0.0014848 | 0.3253598 | 1.665817 | 187 | 2263, 42…. |
| KEGG_AMINOACYL_TRNA_BIOSYNTHESIS | 0.0001496 | 0.0016311 | -0.5491910 | -1.967733 | 40 | 6897, 26…. |
| REACTOME_METABOLISM_OF_CARBOHYDRATES | 0.0001620 | 0.0017426 | -0.4262994 | -1.837159 | 95 | 6514, 41…. |
| KEGG_COMPLEMENT_AND_COAGULATION_CASCADES | 0.0001810 | 0.0019224 | 0.5141089 | 1.994053 | 43 | 710, 713…. |
| BIOCARTA_CD40_PATHWAY | 0.0001990 | 0.0020750 | 0.7449371 | 2.057577 | 13 | 958, 421…. |
| REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS | 0.0002006 | 0.0020750 | -0.4025476 | -1.783636 | 113 | 9844, 56…. |
| REACTOME_FACILITATIVE_NA_INDEPENDENT_GLUCOSE_TRANSPORTERS | 0.0002354 | 0.0024051 | -0.8450700 | -1.980798 | 8 | 6514, 15…. |
| REACTOME_DOWNSTREAM_TCR_SIGNALING | 0.0002582 | 0.0026047 | 0.5765914 | 2.009129 | 29 | 3122, 31…. |
| REACTOME_GLUCONEOGENESIS | 0.0002906 | 0.0028954 | -0.5999420 | -1.933853 | 26 | 4191, 52…. |
| REACTOME_GLUCOSE_METABOLISM | 0.0003022 | 0.0029742 | -0.5151851 | -1.909751 | 47 | 4191, 52…. |
| REACTOME_RHO_GTPASE_CYCLE | 0.0003076 | 0.0029906 | 0.3747026 | 1.720242 | 96 | 4952, 36…. |
| REACTOME_AXON_GUIDANCE | 0.0003398 | 0.0032643 | 0.3495046 | 1.682180 | 128 | 54209, 5…. |
| BIOCARTA_LAIR_PATHWAY | 0.0004094 | 0.0038867 | 0.7947663 | 2.069035 | 10 | 3689, 71…. |
| REACTOME_HIV_INFECTION | 0.0004181 | 0.0039232 | -0.3520578 | -1.653085 | 172 | 9844, 56…. |
| REACTOME_ACTIVATION_OF_CHAPERONES_BY_IRE1_ALPHA | 0.0004449 | 0.0041270 | -0.8005015 | -1.996247 | 10 | 10130, 4…. |
| BIOCARTA_TCRA_PATHWAY | 0.0005032 | 0.0046150 | 0.8973349 | 1.865580 | 5 | 3122, 31…. |
| REACTOME_TELOMERE_MAINTENANCE | 0.0005958 | 0.0054027 | -0.5079837 | -1.842100 | 42 | 3015, 61…. |
| REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING | 0.0006193 | 0.0055531 | -0.7067907 | -1.950160 | 15 | 515, 516…. |
| KEGG_VIBRIO_CHOLERAE_INFECTION | 0.0006399 | 0.0056744 | -0.5107938 | -1.859700 | 43 | 51382, 5…. |
| BIOCARTA_TNFR2_PATHWAY | 0.0006677 | 0.0058565 | 0.6805632 | 2.023335 | 15 | 4214, 71…. |
| BIOCARTA_EPO_PATHWAY | 0.0006837 | 0.0059330 | 0.6677861 | 2.102247 | 18 | 3717, 20…. |
| REACTOME_GLYCOLYSIS | 0.0006983 | 0.0059947 | -0.6696323 | -1.947766 | 17 | 5208, 52…. |
| BIOCARTA_IL3_PATHWAY | 0.0007130 | 0.0060572 | 0.7161174 | 1.977975 | 13 | 3717, 66…. |
| KEGG_HEMATOPOIETIC_CELL_LINEAGE | 0.0007761 | 0.0065240 | 0.5384894 | 1.940742 | 31 | 3122, 14…. |
| REACTOME_MITOTIC_PROMETAPHASE | 0.0008870 | 0.0073793 | -0.4447900 | -1.755046 | 64 | 6396, 55…. |
| REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC | 0.0009741 | 0.0080215 | -0.6078353 | -1.885981 | 22 | 7846, 74…. |
| KEGG_CITRATE_CYCLE_TCA_CYCLE | 0.0011144 | 0.0090840 | -0.5486226 | -1.826113 | 29 | 4191, 41…. |
| BIOCARTA_GLEEVEC_PATHWAY | 0.0012134 | 0.0097923 | 0.5666274 | 1.869273 | 22 | 4214, 37…. |
Finally we represent the top 25 enriched pathways between T1D and ND in a dot plot (Figure 21).
Figure 21: Dotplot of the top 25 GSEA enriched pathways between T1D and ND
To address the main question of our project, which involved understanding potential differential expression patterns between the studied groups, a methodical and thorough analysis was carried out. In this context, this work revealed no differentially expressed genes between patients suffering from diabetes (T1D) and auto-antibody individuals (AAb), suggesting that the two conditions are similar. On the other hand, only two DE genes were found between AAb and control (ND), in comparison with the six total genes found in the reference study. In our analysis, one gene is upregulated (LINC01099) and one downregulated (CD74). Interestingly, LINC01099 is a non-coding RNA which was not detected in Campbell-Thompson et al. (2021) work. However, previous studies have reported its presence and it was considered to be expressed mainly in β-pancreatic cells (Segerstolpe et al. (2016)). In the case of autoimmune diseases related genes, CD74 stands out. It is also known for being part of the class II major histocompatibility complex (MHC). Our analysis indicated that CD74 expression was significantly higher in both AAb and T1D donors compared to ND donors, which is also consistent with the findings in the reference paper. Furthermore, this gene was previously reported to have increased expression in β-cells coming from T1D patients. This study claims to be the first one detecting increased expression of this gene specifically in AAb donors, which is supported as well by our results.
In the case of T1D and ND, more than 2000 genes were found to be differentially expressed, in comparison with the 89 DE genes shown in the paper. Between the top genes we found several autoimmune disease-related genes. As expected, genes implicated in endrocine system development, regulation and insulin secretion were also significant, giving raise to an enrichment in pathways of the same characteristics. Other genes were found, such as MT1G, which translates into a metallothionein 1 (MT1) protein that binds a variety of metals. MT1G has been previously reported in β-cell biology as a negative regulator of glucose-stimulated insulin secretion (Bensellam, Laybutt, and Jonas (2021)).
Furthermore, a relation between neurological-related pathways and type-1 diabetes is highlighted. In this context, some DE genes found in our analyses include NEURL3, LRRTM3 and NPTX2. Interestingly, the latter, NPTX2 (Neuronal Pentraxin 2 gene, down-regulated), becomes progressively decreased with age (Soares Bispo Santos Silva et al. (2015)). Decreased levels of this modulator are associated with synapse loss and therefore, cognitive decline. It is essential for synaptic plasticity and inhibitory-excitatory balance in the central nervous system (Steenoven et al. (2020)). In this context, NPTX2 levels are also reduced in diabetic β-cells, as supported by previous findings, which is also shown in our report. In the same tone, several studies have reported strong relationship between diabetes disease and Alzheimer’s disease (which is, in fact, considered a comorbid condition with diabetes) (Alamro et al. (2023)), as well as Parkinson’s disease. Indeed, two of the KEGG enriched pathways found significant in our dataset are related to these two neuropathologies (KEGG_ALZHEIMERS_DISEASE, KEGG_PARKINSONS_DISEASE).
However, we were not able to replicate all the enriched pathways related to neurological or synaptic processes such as Dopamine Receptor Signaling or α-Adrenergic Signaling highlighted in the article, which may be due to the fact that different approaches were chosen for the enrichment analysis. Nevertheless, neurological processes are undoubtedly affected in this disease.
Some of the differences seen when obtaining the results in comparison with the reference paper could be due to the fact that authors did not filter lowly expressed genes, and apparently did not use covariates as well. In this sense, our analysis involved examining how samples group using multidimensional scaling (MDS) and hierarchical clustering, annotated by disease state, sex, BMI, age, and cause of death. In the MDS analysis, a prominent clustering pattern with a main cluster and three outliers, including sample T1D_6109 (a 5-year-old) was observed. Moreover, no clear clustering based on the variables examined was identified. The distinct outlier suggests significant deviations in expression profiles, warranting further investigation into potential age-related expression patterns or technical artifacts.
According to the Hierarchical Clustering Analysis, no distinct clustering pattern emerged when using sex, indicating no segregation based on gender. In the case of BMI, some clustering was observed, particularly with a concentration of underweight samples, but other BMI categories did not show clear clusters, suggesting additional contributing factors.
Interestingly, a discernible age-based clustering pattern was found, indicating age might be a confounding variable influencing expression profiles. That is why we suggest careful consideration to avoid bias, though further validation is needed to confirm the potential effect age may have in this disease.
-The analysis revealed no differentially expressed genes between diabetes (T1D) and auto-antibody (AAb) individuals, suggesting that these conditions share similar expression profiles.
-The present findings claim the existence of diverse biological functions affected in diabetes disease.
-Over 2000 genes were differentially expressed between T1D and ND, significantly more than the 89 genes reported in the reference study. Many of these genes are related to autoimmune diseases and endocrine system functions.
-The analysis highlighted the involvement of some neurological pathways in T1D. Enriched pathways included those related to Alzheimer’s and Parkinson’s diseases, underscoring the comorbidity between diabetes and these neurological conditions.
-Differences in results compared to the reference study might stem from filtering lowly expressed genes and incorporating covariates. MDS and hierarchical clustering analyses showed no clear segregation by sex or BMI but revealed an age-based clustering pattern. Age might act as a confounding variable influencing expression profiles, necessitating careful consideration to avoid bias in future analyses.
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Madrid
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] fgsea_1.30.0 GSVAdata_1.40.0
[3] hgu95a.db_3.13.0 GSEABase_1.66.0
[5] dplyr_1.1.4 GO.db_3.19.1
[7] GOstats_2.70.0 graph_1.82.0
[9] Category_2.70.0 Matrix_1.6-5
[11] org.Hs.eg.db_3.19.1 ggplot2_3.5.1
[13] gridExtra_2.3 sva_3.52.0
[15] BiocParallel_1.38.0 genefilter_1.86.0
[17] mgcv_1.9-1 nlme_3.1-163
[19] geneplotter_1.82.0 annotate_1.82.0
[21] XML_3.99-0.16.1 AnnotationDbi_1.66.0
[23] lattice_0.22-5 edgeR_4.2.0
[25] limma_3.60.2 SummarizedExperiment_1.34.0
[27] Biobase_2.64.0 GenomicRanges_1.56.0
[29] GenomeInfoDb_1.40.1 IRanges_2.38.0
[31] S4Vectors_0.42.0 BiocGenerics_0.50.0
[33] MatrixGenerics_1.16.0 matrixStats_1.3.0
[35] usethis_2.2.3 here_1.0.1
[37] kableExtra_1.4.0 knitr_1.47
[39] BiocStyle_2.32.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 DBI_1.2.2 RBGL_1.80.0
[4] rlang_1.1.3 magrittr_2.0.3 compiler_4.4.0
[7] RSQLite_2.3.7 png_0.1-8 systemfonts_1.1.0
[10] vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3
[13] crayon_1.5.2 fastmap_1.2.0 magick_2.8.3
[16] XVector_0.44.0 labeling_0.4.3 utf8_1.2.4
[19] rmarkdown_2.27 UCSC.utils_1.0.0 tinytex_0.51
[22] purrr_1.0.2 bit_4.0.5 xfun_0.44
[25] zlibbioc_1.50.0 cachem_1.1.0 jsonlite_1.8.8
[28] blob_1.2.4 highr_0.11 DelayedArray_0.30.1
[31] parallel_4.4.0 R6_2.5.1 bslib_0.7.0
[34] stringi_1.8.4 RColorBrewer_1.1-3 jquerylib_0.1.4
[37] Rcpp_1.0.12 bookdown_0.39 tidyselect_1.2.1
[40] splines_4.4.0 rstudioapi_0.16.0 abind_1.4-5
[43] yaml_2.3.8 codetools_0.2-19 tibble_3.2.1
[46] withr_3.0.0 KEGGREST_1.44.0 evaluate_0.23
[49] AnnotationForge_1.46.0 survival_3.5-8 xml2_1.3.6
[52] Biostrings_2.72.0 pillar_1.9.0 BiocManager_1.30.23
[55] KernSmooth_2.23-22 generics_0.1.3 RCurl_1.98-1.14
[58] rprojroot_2.0.4 munsell_0.5.1 scales_1.3.0
[61] xtable_1.8-4 glue_1.7.0 tools_4.4.0
[64] data.table_1.15.4 locfit_1.5-9.9 fs_1.6.4
[67] fastmatch_1.1-4 cowplot_1.1.3 grid_4.4.0
[70] colorspace_2.1-0 GenomeInfoDbData_1.2.12 cli_3.6.2
[73] fansi_1.0.6 S4Arrays_1.4.1 viridisLite_0.4.2
[76] svglite_2.1.3 Rgraphviz_2.48.0 gtable_0.3.5
[79] sass_0.4.9 digest_0.6.35 SparseArray_1.4.8
[82] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
[85] lifecycle_1.0.4 httr_1.4.7 statmod_1.5.0
[88] bit64_4.0.5